Optimal Feature Selection
Automatic selection of optimal features from the noise
Linear models struggle with high-dimensional noisy data
Finding the perfect subset is unfortunately an NP hard problem, making it computationally prohibitive for problems of practical sizes. There exist heuristic approaches for selecting features such as forward and backward stepwise variable selection, as well as regularization methods like Lasso and elastic net. However, experiments with both synthetic and real data show that these approximation methods such as Lasso fail to find the exact solution, and as a result they often find many extra false features.
True variable selection with modern optimization
Compared to LASSO, Optimal Feature Selection reduces falsely selected features by a half while maintaining same level of true positive rates. This means it is more efficient in choosing the right variables, and that the resulting model is simplier, more interpretable, and more accurate.
Optimal Feature Selection has highest accuracy and lower false alarm compared to Lasso in an experiment with 2000 features and only 10 of which are relevant.
High interpretability and accuracy with a fraction of selected features
As the goal of this application was not only to predict accurately but also to learn the physical mechanism that affects the outcome, the simpler model stands out even more, as it offers a practical small set of best variables for the engineers to further investigation and make changes to improve outcome.
Comparing the performance over different number of features selected between Optimal Feature Selection and Elastic Net
Extremely fast and scalable
In particular, for industry scale datasets with over millions of observations and thousands of features, the heuristic option can find optimal features in just matter of seconds. In addition, experiments show that in the majority of cases studied, the heuristic options in fact find the exact solution, therefore no tradeoff is needed when the data scale is extremely large.
Computational time comparisons. Both exact and approximate methods remain tractable as the number of samples increases.
Selected cases using Optimal Feature Selection
Related publications
Sparse high-dimensional regression: Exact scalable algorithms and phase transitions
Dimitris Bertsimas and Bart Van Parys
The Annals of Statistics, 2020
The original publication by our co-founder pioneering Optimal Feature Selection. This paper presents the first scalable approach to exact subset selection in the linear regression problem, and demonstrates superior empirical results compared to existing heuristic approaches.
Sparse Regression: Scalable algorithms and empirical performance
Dimitris Bertsimas, Jean Pauphilet and Bart Van Parys
Statistical Science, 2020
This paper develops an extremely scalable heuristic for solving the Optimal Feature Selection problem that is significantly faster than the original approach, without sacrificing performance.
Sparse classification and phase transitions: A discrete optimization perspective
Dimitris Bertsimas, Jean Pauphilet and Bart Van Parys
Preprint available on arXiv
The Optimal Feature Selection methodology is extended to classification problems, namely logistic regression and support vector machines. Experiments demonstrate the superior performance compared to alternative methods.